ggplot2In the lecture, we showed that statistics alone may lead to a
misunderstanding of the data. Therefore, when working with new data, we
should always make some visualizations to help us understand the data. A
common way for plotting in R today is through ggplot2.
ggplot2 is an R package (located in
tidyverse) for “decoratively creating graphics”
library(tidyverse)
A FiveThirtyEight article published in 2017 claimed that higher rates of hate crimes were tied to greater income inequality.
https://fivethirtyeight.com/features/higher-rates-of-hate-crimes-are-tied-to-income-inequality/
Follow these steps to read the data into RStudio:
Another way to do this:
read.csv(hate.crimes.csv) See
?read.csv for more information about this functionhate_crimes <- read.csv("./data/hate_crimes.csv")
glimpse(hate_crimes)
## Rows: 51
## Columns: 13
## $ state <chr> "Alabama", "Alaska", "Arizona…
## $ region <chr> "South", "West", "West", "Sou…
## $ median_income <int> 42278, 67629, 49254, 44922, 6…
## $ unemployment <dbl> 0.060, 0.064, 0.063, 0.052, 0…
## $ share_population_in_metro_areas <dbl> 0.64, 0.63, 0.90, 0.69, 0.97,…
## $ share_population_with_high_school_degree <dbl> 0.821, 0.914, 0.842, 0.824, 0…
## $ share_non_citizen <dbl> 0.02, 0.04, 0.10, 0.04, 0.13,…
## $ share_white_poverty <dbl> 0.12, 0.06, 0.09, 0.12, 0.09,…
## $ gini_index <dbl> 0.472, 0.422, 0.455, 0.458, 0…
## $ share_non_white <dbl> 0.35, 0.42, 0.49, 0.26, 0.61,…
## $ share_voters_voted_trump <dbl> 0.63, 0.53, 0.50, 0.60, 0.33,…
## $ hate_crimes_splc <dbl> 0.12583893, 0.14374012, 0.225…
## $ hate_crimes_fbi <dbl> 1.8064105, 1.6567001, 3.41392…
head(hate_crimes)
## state region median_income unemployment share_population_in_metro_areas
## 1 Alabama South 42278 0.060 0.64
## 2 Alaska West 67629 0.064 0.63
## 3 Arizona West 49254 0.063 0.90
## 4 Arkansas South 44922 0.052 0.69
## 5 California West 60487 0.059 0.97
## 6 Colorado West 60940 0.040 0.80
## share_population_with_high_school_degree share_non_citizen
## 1 0.821 0.02
## 2 0.914 0.04
## 3 0.842 0.10
## 4 0.824 0.04
## 5 0.806 0.13
## 6 0.893 0.06
## share_white_poverty gini_index share_non_white share_voters_voted_trump
## 1 0.12 0.472 0.35 0.63
## 2 0.06 0.422 0.42 0.53
## 3 0.09 0.455 0.49 0.50
## 4 0.12 0.458 0.26 0.60
## 5 0.09 0.471 0.61 0.33
## 6 0.07 0.457 0.31 0.44
## hate_crimes_splc hate_crimes_fbi
## 1 0.12583893 1.8064105
## 2 0.14374012 1.6567001
## 3 0.22531995 3.4139280
## 4 0.06906077 0.8692089
## 5 0.25580536 2.3979859
## 6 0.39052330 2.8046888
Basic Format of a Plot:
data and aesthetics + plot type + options
The data, aesthetics, and plot type are necessary to create a plot. For example, below is what happens when we just specify the data and aesthetics.
ggplot(hate_crimes, aes(x=median_income, y=hate_crimes_fbi))
After specifying the data and aesthetics, we need to decide the plot type. In order to do that, we need to know the variable type(s). There are two different ways to distinguish the variables.
By function:
By value type:
ggplot(hate_crimes, aes(x=median_income)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(hate_crimes, aes(x=median_income)) +
geom_histogram(bins = 10)
ggplot(hate_crimes, aes(x=median_income)) +
geom_density()
ggplot(hate_crimes, aes(x=median_income)) +
geom_density(fill = "blue", alpha = 0.5)
ggplot(hate_crimes, aes(x=region)) +
geom_bar()
ggplot(hate_crimes, aes(x=region)) +
geom_bar() +
coord_flip()
ggplot(hate_crimes, aes(x=median_income, y=hate_crimes_fbi)) +
geom_point()
ggplot(hate_crimes, aes(x=median_income, y=hate_crimes_fbi)) +
geom_text(aes(label=state))
ggplot(hate_crimes, aes(x=median_income, y=hate_crimes_fbi)) +
geom_line()
ggplot(hate_crimes, aes(x=median_income, y=hate_crimes_fbi)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
What is the difference between box-plot and histogram/density plot? Box-plot shows the median but not the distribution.
ggplot(hate_crimes, aes(x=median_income)) +
geom_boxplot()
ggplot(hate_crimes, aes(y=median_income)) +
geom_boxplot()
Suppose we are interested in the unemployment rate and want to see its distribution.
Suppose we want to show the relation between unemployment rate and median income.
Report an approximate median for the unemployment rate.
What if we want to visualize one continuous variable and one categorical variable? Sometimes we want to compare the variable(s) across multiple groups. eg: compare median income across different region. Which plots can compare multiple group?
Theses are called side-by-side plots.
ggplot(hate_crimes, aes(x=median_income, group = region)) +
geom_density()
ggplot(hate_crimes, aes(y=median_income, group = region)) +
geom_boxplot()
Wait, how can I know which group is which group?
ggplot(hate_crimes, aes(y=median_income, x = region)) +
geom_boxplot()
ggplot(hate_crimes, aes(y=median_income, group = region)) +
geom_boxplot(aes(color = region))
ggplot(hate_crimes, aes(x=median_income, group = region)) +
geom_density(aes(color = region))
ggplot(hate_crimes, aes(x=median_income, y=hate_crimes_fbi, group = region)) +
geom_point(aes(color = region))
ggplot(hate_crimes, aes(x=median_income, y=hate_crimes_fbi, group = region)) +
geom_point(aes(color = region, size = unemployment))
Add x-axis, y-axis labels and title
ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_fbi)) +
geom_point(aes(color=region, size=unemployment)) +
xlab('Gini Index') +
ylab('Hate Crimes (FBI, Sept. 2016)') +
ggtitle('The relation between Gini Index and Hate Crime rate in 2016')
Instead of putting all groups information into one page, you can do by each panel.
ggplot(hate_crimes, aes(x=gini_index, y=hate_crimes_fbi)) +
geom_point(aes(color=region, size=unemployment)) +
xlab('Gini Index') +
ylab('Hate Crimes (FBI, Sept. 2016)') +
ggtitle('The relation between Gini Index and Hate Crime rate in 2016')+
facet_wrap(~region, nrow=1)
Show the unemployment rate across different region. Use color to indicate different regions.
Show the relation between unemployment rate and FBI hate crime rate. Use size to indicate gini index. Make sure to include axis labels and title.
Plot the distribution of gini index and put differnet region on different panel.
The data set Weather contains data on weather-related
variables for several world cities.
#install.packages(mosaicData)
library(mosaicData)
data(Weather)
glimpse(Weather)
## Rows: 3,655
## Columns: 25
## $ city <chr> "Auckland", "Auckland", "Auckland", "Auckland", "Aucklan…
## $ date <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-01…
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 20…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ high_temp <dbl> 68, 68, 77, 73, 69, 69, 71, 77, 69, 71, 75, 69, 71, 75, …
## $ avg_temp <dbl> 65, 66, 72, 66, 62, 63, 66, 70, 66, 66, 67, 66, 66, 68, …
## $ low_temp <dbl> 62, 64, 66, 60, 55, 57, 60, 64, 64, 62, 59, 62, 62, 62, …
## $ high_dewpt <dbl> 64, 64, 70, 66, 55, 54, 59, 72, 68, 63, 61, 66, 61, 63, …
## $ avg_dewpt <dbl> 60, 63, 67, 60, 52, 51, 54, 67, 61, 58, 58, 62, 57, 61, …
## $ low_dewpt <dbl> 55, 61, 64, 54, 48, 46, 50, 59, 55, 55, 54, 59, 54, 59, …
## $ high_humidity <dbl> 100, 100, 100, 100, 82, 88, 83, 100, 100, 88, 94, 100, 8…
## $ avg_humidity <dbl> 82, 94, 91, 76, 69, 65, 65, 92, 81, 76, 72, 87, 73, 80, …
## $ low_humidity <dbl> 68, 88, 74, 53, 56, 46, 53, 83, 64, 64, 53, 78, 64, 65, …
## $ high_hg <dbl> 30.15, 30.04, 29.80, 30.12, 30.21, 30.24, 30.24, 30.01, …
## $ avg_hg <dbl> 30.09, 29.90, 29.73, 29.90, 30.14, 30.22, 30.13, 29.79, …
## $ low_hg <dbl> 30.01, 29.80, 29.68, 29.77, 30.09, 30.18, 30.04, 29.62, …
## $ high_vis <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ avg_vis <dbl> 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
## $ low_vis <dbl> 4, 1, 1, 6, 6, 6, 6, 2, 4, 6, 6, 3, 6, 6, 6, 6, 6, 2, 6,…
## $ high_wind <dbl> 21, 33, 18, 15, 13, 17, 22, 21, 18, 20, 17, 16, 18, 17, …
## $ avg_wind <dbl> 15, 21, 12, 10, 7, 8, 12, 14, 11, 15, 10, 8, 13, 9, 7, 4…
## $ low_wind <dbl> 28, 46, NA, NA, NA, 28, 25, 28, 29, NA, 26, NA, NA, NA, …
## $ precip <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "…
## $ events <chr> "Rain", "Rain", "Rain", "Rain", NA, NA, "Rain", "Rain", …
We want to study the average temperature. First, we want to see its distribution.
ggplot(Weather, aes(x=avg_temp)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Weather, aes(x=avg_temp)) +
geom_density()
ggplot(Weather, aes(y=avg_temp)) +
geom_boxplot()
The box-plot of average temperature across by different cities.
ggplot(Weather, aes(y=avg_temp, group=city)) +
geom_boxplot(aes(color=city))
ggplot(Weather, aes(y=avg_temp, x = city, group=city)) +
geom_boxplot(aes(color=city))
ggplot(Weather, aes(y=avg_temp, x = city, group=city)) +
geom_boxplot(aes(color=city)) + theme(legend.position = "none")
The average temperature may be related to date. How to show the relation between temperature and date?
ggplot(Weather, aes(x=date, y=avg_temp)) +
geom_point()
Maybe different city has totally different trend on average temperature.
ggplot(Weather, aes(x=date, y=avg_temp)) +
geom_point(aes(color=city))
Beijing <- Weather %>% filter(city=='Beijing')
ggplot(Beijing, aes(x=date, y=avg_temp)) +
geom_point()
ggplot(Beijing, aes(x=date, y=avg_temp)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
ggplot(Beijing, aes(x=date, y=avg_temp)) +
geom_smooth() +
geom_point()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Instead of the date, we can also use the month. If I want to show the relation between month and temperature.
ggplot(Beijing, aes(y=avg_temp, x = month)) +
geom_point()
ggplot(Beijing, aes(y=avg_temp, x = as.factor(month))) +
geom_point()
ggplot(Beijing, aes(x=avg_temp)) +
geom_density(fill='lightblue', alpha=0.5)
ggplot(Beijing, aes(x=avg_temp, group=month)) +
geom_density(aes(color=as.factor(month), fill=as.factor(month)), alpha=0.5)
ggplot(Beijing, aes(x=avg_temp, group=month)) +
geom_density(aes(color=as.factor(month), fill=as.factor(month)), alpha=0.5) +
facet_wrap(~month, nrow=3)
We can also use the lowest temperature and highest temperature, instead of the average.
ggplot(Beijing, aes(x=low_temp, y=high_temp)) +
geom_point(aes(color=as.factor(month)))
What else we can explore? How about events?
ggplot(Beijing, aes(x=date, y=avg_temp)) +
geom_point(aes(color=events))
ggplot(Beijing, aes(x=events)) +
geom_bar() +
coord_flip()
ggplot(Beijing, aes(x=events)) +
geom_bar() +
coord_flip() +
facet_wrap(~year, nrow=2)